Fitting Logistic Regression Models$

\newcommand{\cond}{{\mkern+2mu} \vert {\mkern+2mu}} \newcommand{\SetDiff}{\mathrel{\backslash}} \DeclareMathOperator{\BetaFunc}{Β} \DeclareMathOperator{\GammaFunc}{Γ} \DeclareMathOperator{\prob}{p} \DeclareMathOperator{\cost}{J} \DeclareMathOperator{\score}{V} \DeclareMathOperator{\dcategorical}{Categorical} \DeclareMathOperator{\dcategorical}{Categorical} \DeclareMathOperator{\ddirichlet}{Dirichlet} $

"The logistic regression model arises from the desire to model the posterior probabilities of the $K$ classes via linear functions in $x$, while at the same time ensuring that they sum to one and remain in $[0, 1]$.", Hastie et al., 2009 (p. 119).

$$ \begin{align} \log \frac{\prob(Y = k \cond X = x)}{\prob(Y = K \cond X = x)} = \beta_k^{\text{T}} x && \text{for } k = 1, \dotsc, K-1. \end{align} $$

The probability for the case $Y = K$ is held out, as the probabilities must sum to one, so there are only $K-1$ free variables. Thus if there are two categories, there is just a single linear function.

This gives us that $$ \begin{align} \prob(Y = k \cond X = x) &= \frac{\exp(\beta_k^{\text{T}}x)}{1 + \sum_{i=1}^{K-1} \exp(\beta_i^{\text{T}}x)} & \text{for } k = 1, \dotsc, K-1 \\[3pt] \prob(Y = K \cond X = x) &= \frac{1}{1 + \sum_{i=1}^{K-1} \exp(\beta_i^{\text{T}}x)}. \end{align} $$

Note that if we fix $\beta_K = 0$, we have the form $$ \begin{align} \prob(Y = k \cond X = x) &= \frac{\exp(\beta_k^{\text{T}}x)}{\sum_{i=1}^K \exp(\beta_i^{\text{T}}x)} & \text{for } k = 1, \dotsc, K. \end{align} $$

Then, writing $\beta = \{\beta_1^{\text{T}}, \dotsc, \beta_{K}^{\text{T}}\}$, we have that $\prob(Y = k \cond X = x) = \prob_k(x; \beta)$.

The log likelihood for $N$ observations is $$ \ell(\beta) = \sum_{n=1}^N \log \prob_{y_n}(x_n; \beta). $$

We can use the cost function $$ \begin{align} \cost(\beta) &= -\frac{1}{N} \left[ \sum_{n=1}^N \sum_{k=1}^K 1[y_n = k] \log \prob_k(x_n; \beta) \right] \\ &= -\frac{1}{N} \left[ \sum_{n=1}^N \sum_{k=1}^K 1[y_n = k] \log \frac{\exp(\beta_k^{\text{T}}x_n)}{\sum_{i=1}^K \exp(\beta_i^{\text{T}}x_n)} \right] \\ &= -\frac{1}{N} \left[ \sum_{n=1}^N \sum_{k=1}^K 1[y_n = k] \left( \beta_k^{\text{T}}x_n - \log \sum_{i=1}^K \exp(\beta_i^{\text{T}}x_n) \right) \right]. \end{align} $$

Thus the score function is $$ \score_k(\beta) = \nabla_{\beta_k} \cost(\beta) = -\frac{1}{N} \left[ \sum_{n=1}^N x_n \big( 1[y_n = k] - \prob_k(x_n; \beta) \big) \right]. $$